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ABSTRACT 

The increasing importance of Web 2.0 applications during 
the last years has created significant interest in tools for ana- 
lyzing and describing collective user activities and emerging 
phenomena within the Web. Network structures have been 
widely employed in this context for modeling users, web re- 
sources and relations between them. However, the amount 
of data produced by modern web systems results in networks 
that are of unprecedented size and complexity, and are thus 
hard to interpret. To this end, community detection meth- 
ods attempt to uncover natural groupings of web objects by 
analyzing the topology of their containing network. 

There are numerous techniques adopting a global perspec- 
tive to the community detection problem, i.e. they operate 
on the complete network structure, thus being computation- 
ally expensive and hard to apply in a streaming manner. In 
order to add a local perspective to the study of the problem, 
we present Bridge Bounding, a local methodology for com- 
munity detection, which explores the local network topology 
around a seed node in order to identify edges that act as 
boundaries to the local community. The proposed method 
can be integrated in an efficient global community detection 
scheme that compares favorably to the state of the art. As a 
case study, we apply the method to explore the topic struc- 
ture of the LYCOS iQ collaborative question/answering ap- 
plication by detecting communities in the networks created 
from the coUective tagging activity of users. 
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I. INTRODUCTION 

Network structures (also called graphs in mathematical 
literature) and the associated analysis methods have long 
emerged as a valuable tool for modeling and analyzing the 
relations among objects in a variety of established scientific 
disciplines, e.g. social sciences and biology [21]. Recent 
years however have witnessed a substantial adoption of net- 
work analysis techniques in the field of computer science, 
and more specifically, in the modeling and analysis of mas- 
sive data sets produced by online information systems, such 
as Web 2.0 applications. 

In the field of network research, the problem of commu- 
nity detection has lately attracted significant interest since 
identifying the community structure of large networks can 
improve our understanding of the complex relations that ex- 
ist among their elements. The origins of this problem can be 
traced in the fields of citation study [13], bibliometrics [28] 
and social network analysis [25]. More recently, this prob- 
lem has been restated in the context of web graphs, i.e. the 
networks created from mapping the web hyperlink structure 
to the directed network model. Two seminal web commu- 
nity definitions were formulated by Kumar et al. [18] and 
Flake et al. [11]. According to the first, a community is a 
dense directed bipartite subgraph of the web graph [18] . The 
latter definition states that a community is a vertex sub- 
set of the graph such that each of its members has at least as 
many edges to other members of the community as it does to 



non-rnernher vertices [11]. Although these two community 
definitions are different, they both result in the formulation 
of community detection as a problem of finding a partition 
of a graph into subgraphs that mciximizes some measure of 
within-subgraph density. 

Due to the extremely high complexity of providing an ex- 
act solution to the community detection problem for the 
complete network^, several attempts have been made to do- 
rive approximate solutions at reduced computational costs, 
with some of the most efficient techniques having a com- 
plexity of 0{nlog^n) [22] and 0{m + n) [29] for networks 
of n nodes and m edges. Despite being very efficient, most 
of the existing approaches adopt a global perspective, i.e. 
they operate on the full network, in order to output the de- 
tected community structure. In practice, however, there is 
frequently a need to explore the network structure at a local 
level, e.g. in interactive network visualization [26] and infor- 
mation retrieval applications [27]. Such applications impose 
severe constraints on the response time of the underlying 
network analysis processes, thus prohibiting the use of global 
comnmnity detection methods. To date, only few methods 
have been proposed that can be used for comnmnity detec- 
tion at a local level [2, 29]. However, they are either unsuit- 
able for networks of scale-free topology (frequently emerging 
in practice) [2] or are not local by design, thus not achieving 
maximum efficiency when applied as local [29]. 

The situation described above motivated us to introduce a 
methodology for performing community detection at a local 
level; we call the proposed methodology Bridge Bounding. 
Bridge Bounding initiates the community detection process 
from a seed node in the network and progressively attaches 
neighboring nodes to the community as long as the edges 
connecting these nodes do not act as boundaries. Thus, com- 
munity detection is formulated as a problem of identifying 
edges that act as community boundaries, (which we also call 
bridges, since they connect communities of the network to 
each other [8, p. 140]). This problem is tackled by means of 
local network topology functions, i.e. functions that examine 
the network structure around an edge (local network topol- 
ogy) and produce a measure of the extent that these edges 
act as bridges. An example of such a function is the edge 
clustering coefficient [24]. In that way, we ensure that the 
proposed approach has low complexity and at the same time 
is capable of precisely identifying community boundaries. 

In order to demonstrate the benefits of our approach, we 
applied it to both synthetic and real networks. As a first 
step, we validated Bridge Bounding by testing its perfor- 
mance on the known community structure of synthetic net- 
works and comparing it with the widely cited approach of 
Girvan and Newman [15]. The proposed method could suc- 
cessfully detect the synthetic communities across a variety 
of network generation parameters and achieved equivalent 
or better performance than the competing method, while 
being computationally much more efficient. Subsequently, 
we employed Bridge Bounding to explore the community 
structure of two tag networks, English and German, cre- 
ated by tags used to annotate questions in the LYCOS iQ 
question/answering application^. A set of tag comrrmnities 
consisting of semantically related tags were extracted, thus 

'^The problem is believed to be NP-complete [23]. 
^We collected data from both the German 
(http://iq.lycos.de) and the English (http://iq.lycos.co.uk) 
version of the application 



revealing the structure of topics associated with the collec- 
tive question-answering activity of users. The extracted tag 
community structure can be exploited for improved topic 
interest monitoring and automatic tag recommendation to 
users of the application. 

The rest of the paper is structured as follows. Section 2 
presents an overview of existing work in the field of commu- 
nity detection in complex networks. Subsequently, the for- 
mal description of the proposed community detection method- 
ology is presented in Section 3. Section 4 presents the results 
and insights we obtained by applying Bridge Bounding both 
to synthetic networks with known community structure and 
to the LYCOS iQ tag network. Finally, Section 5 summa- 
rizes the basic contributions of the paper and delineates our 
future work. 

2. RELATED WORK 

The problem of comnmnity finding in large complex net- 
works has attracted considerable research interest for some 
time now. Its origins can be traced back to the first studies 
of the hyperlink structure of the web, e.g. to the observa- 
tion of Gibson et al. [14] that communities emerge sponta- 
neously around authoritative web pages which are identified 
by means of hub pages. Then, the works by Kumar et al. 
[18] and Flake et al. [11] formally defined and systematically 
tackled the problem of community detection. In the follow- 
ing, we provide a list of existing methods for community 
detection classified according to the approach they adopt. 
A more detailed discussion of existing comnmnity detection 
methods is contained in the survey by Danon et al. [19]. 

Subgraph enumeration. Kumar et al. [18] consider 
communities as dense bipartite subgraphs of the web (seen 
as a directed graph). A natural way to identify dense sub- 
graph structures is by means of graph partition enumeration. 
In order to drastically reduce the vast number of subgraphs 
that arc possible by complete enumeration, the authors em- 
ploy a series of heuristic pruning techniques. An extension 
of this definition led to the notion of 7-dense communities 
[9], which can be efficiently discovered based on more so- 
phisticated subgraph ermmeration and pruning criteria. 

Maximum flow. Flake et al. [11] define communities 
as subsets of vertices that have more links (undirected) to 
each other than to the rest of the network nodes. To detect 
such communities on the web, they integrate a maximum 
flow strategy with an iterative crawling process. A stricter 
community definition was considered by Ino et al. [17] and 
a technique was devised to detect them that was based on 
both the maximum flow algorithm and an iterative graph 
partitioning and contraction process. 

Divisive-Agglomerative methods. According to Gir- 
van and Newman [15], the community structure of a large 
network should be revealed by progressively removing edges 
with high edge betweenness, i.e. by following a divisive 
approach. Following the same approach but with the use 
of different measures, namely the edge clustering coefficient 
and the bridging centrality, Radicchi et al. [24] and Hwang 
et al. [16], respectively, could uncover the underlying com- 
munity structure of complex networks. Later, the measure 
of modularity was deflned by Newman and Girvan, as a 
means to quantify the quality of a network partition into 
communities [23]. More specifically, modularity reflects the 
extent to which a given network partition is characterized 
by higher intra-community density in comparison to the one 



that would be observed in a random partition of the same 
network. Building upon this measure, the methods by New- 
man [22] and Clauset et al. [6] describe efficient implemen- 
tations of community detection by means of agglomerative 
strategies. 

Seed-based Flooding. An alternative approach to as- 
signing the nodes of a network to communities was presented 
by da Fontoura Costa in [7]. There, the community detec- 
tion process starts from a set of hub nodes and is imple- 
mented as a parallel flooding process emanating from the 
hubs. Although being seed-based, the technique in [7] is not 
local since it requires simultaneous discovery of all communi- 
ties in a network. Thus, a local method for community find- 
ing was described by Bagrow and BoUt [2]. The authors con- 
sider an expanding neighborhood around the starting node 
(which they call /-shell) to constitute the community around 
it. In order to finish the expansion process, the authors em- 
ploy a criterion quantifying the change in the total emerging 
degree of the community [2] . 

Hybrid. A combined strategy for community detection 
is provided by Du et al. [10]. The authors consider a three- 
step community detection process: (a) detection of maximal 
cliques (subgraph enumeration), (b) initial network partition 
by progressive expansion of the maximal cliques (flooding) 
and (c) adjustment of the original partition in order to max- 
imize modularity. 

Most of the methods presented above are global, mean- 
ing that they need to process the whole network in order 
to output the identified community structure. Even though 
some of these methods achieve low complexity (linear to the 
size of the network), their use is still prohibitive, when there 
is need for extremely responsive community detection, e.g. 
in interactive exploration of large networks, which can be 
only feasible by means of local processing of the network. 
We could only find two local methods [2, 29] that are suit- 
able for identifying communities within such applications. 
However, wc consider the first of those [2] as unsuitable for 
graphs of scalc-frcc nature (since the /-shell would contain 
the whole graph after just few expansion steps), and the sec- 
ond [29] as not achieving maximum efficiency, since it is not 
local by design (i.e. there axe redundant computation steps 
when applying the method locally). We consider that our 
proposed methodology addresses the community detection 
problem from a local perspective in a more intuitive and 
efficient way. 

Most existing community detection methods, to date, have 
been applied to two types of networks: (a) networks created 
from crawling part of the web and (b) networks reflecting the 
social relations and/or interactions among people. Recently, 
there has also been some work highlighting the value of com- 
munity detection in tagging systems.'^ Part of the case study 
in [4] , which mainly deals with the evaluation of the effective- 
ness of tags as a means to annotate blog articles, describes 
the induction of a tag hierarchy by means of a standard hi- 
erarchical clustering scheme based on cosine similarity. In 
another study [27], a method based on Spectral Recursive 
Embedding is proposed to carry out multi-clustering on the 
two bipartite graphs formed by the documents-words and 
documents-tags interrelationships in order to improve the 
precision of tag recommendation. Finally, Cattuto et al. [5] 
exploit the tag overlap between online resources in order to 

^Community detection is frequently termed clustering in the 
respective literature. 



identify resource communities by means of spectral methods. 
In this work, we apply our proposed methodology to the tag 
network created from the collective tagging activity of the 
LYCOS iQ users. In that way, we show that the topological 
properties of tag networks can be exploited to extract tag 
groups that are semantically related to each other. 

3. METHODOLOGY 

In this section, we will first (Section 3.1) present the basic 
notations and definitions from graph theory that are nec- 
essary to formalize the problem of community detection. 
Then, we will introduce the Bridge Bounding community 
detection methodology in Section 3.2. 

3.1 Basic notation and definitions 

We consider undirected graphs G = {V, E), where V is the 
set of vertices and E xV is the set of edges connecting 
the vertices. An edge connecting nodes i,j£V is denoted as 
Sij . For a vertex s of the graph, we consider its neighborhood 
N(s) consisting of all vertices which are directly connected 
to s, i.e. Vn € N{s) : Csn G E. We define the degree of ver- 
tex V as d{v) = ]A'^(«)]. In a similar way, the neighborhood 
of an edge est consists of all edges that share at least one 
endpoint with est, N{est) = {exy\{x,y}f]{s,t} / 0}. 

Global community detection algorithms process a graph 
G in order to partition the graph into a set of communities, 
P = {Co,Ci, ...,Ck}, where d C V. When the commu- 
nities produced by a method are mutually exclusive, then 
CiClCj = 0, Vi,j € {1,2, ...,K}, with i ^ j. During the 
comnmnity detection process, wc consider the set of nodes 
Cu £ P comprising all nodes that have not been assigned 
to any community until that moment. For convenience, we 
also employ the mapping gc : V P, which returns the 
community a vertex is assigned to (or Cu if the vertex has 
not been assigned to any commurnty yet). 

Local methods for community detection adopt a seed- 
based approach, i.e. given G and a node s in the graph, 
a local method will produce a community Cs axound the 
node. It is possible to induce a global community detection 
method based on a local one by repeatedly applying the lo- 
cal community detection method to randomly selected nodes 
from Cu until this set is empty (i.e. all nodes of the graph 
have been assigned to some community). In the context 
of our evaluation (Section 4), we are going to induce such 
a global community detection scheme by employing the lo- 
cal Bridge Bounding method, which we describe in Section 
3.2. We will refer to this scheme as progressive community 
detection. 

3.2 Community detection by Bridge Bound- 
ing 

Bridge Bounding is based on a simple strategy in order 
to identify the comnmnity Cs surrounding a seed node s. 
A formal description of this strategy is presented below, in 
Algorithm 1. Starting from s, each node n belonging to the 
neighborhood of s is considered a member of Cs as long as 
it meets two conditions (line 8 of the algorithm): (a) it is 
not already member of another community and (b) the edge 
connecting it to s is not a community boundary, i.e. not a 
bridge (in the sense of [8, p. 140]). Then, all neighbors of 
the newly assigned nodes (the frontier set F) are checked 
against the same conditions and are attached to Cs (line 9, 
lines 5-6) if they meet them. This process is repeated un- 



til it is not possible to attach additional nodes to Ca (line 
3). Thus, Bridge Bounding is equivalent to a flooding pro- 
cess, similar to the one described in [7], which stops when 
all nodes belonging to its frontier are adjacent to a bridge 
(community boundary). 



Algorithm 1 LocalCommunityDetection 
Require: Seed node s £ G = {V, E) 
Require: Community mapping gc - V 
Require: Bridge function h : E ^ [0.0, 1.0] 

I: Cs = (d 

2: Frontier set _F = {s} 

3: while |F| > do {F is non-empty} 

4: c ^ F.popO 

5: C. ^ Cs U{c} 

6: Cu ^ Cu\{c} 

7: for all n £ N{c) such that ecn = (c, n) £ E do 
8: if gc(n) ~ Cu and 6(ecn) < Bl then 
9: F.push(n) 

10: end if 

11: end for 

12: end while 

13: P^PUCs 




(a) Graph G = {V, E) (b) ^^(e) distribution, e£ E 



Figure 1: Relation of edge position in the graph 
and local bridging probability distribution func- 
tion (pdf). Edges drawn with dashed lines on the 
network of Figure 1(a) are also the ones with the 
highest local bridging values (the part of the distri- 
bution in Figure 1(b) plotted in dashed line). 

where zi^ is the number of triangles containing that edge. 
Note that the larger the clustering coefficient is, the less the 
edge acts like a bridge. Hence, we define the local bridging 
of an edge as: 



The quality of the community structure output by Bridge 
Bounding is entwined with the success of quantifying the 
bridging behavior of edges. Let us consider the function 
b : E [0,1], which maps edges to real numbers in the 
given interval, to quantify the extent to which they act as 
bridges. In order for Bridge Bounding to make a binary 
decision on whether an edge e is a bridge or not (in order to 
stop or continue the community flooding process along this 
edge), the output of the bridging function, &(e), is compared 
against some threshold Bl (which can be derived by analysis 
of the distribution of b(e) values as will be shown later) . 

The problem of quantifying the bridging behavior of edges 
on a graph has been already studied and several measures 
based on graph topology have been developed with the goal 
of capturing the extent to which an edge acts as a bridge 
between different communities. One of the first attempts 
to define 6(e) was by means of its betweenness centrality 
as described in [23]. For a given edge e, its betweenness 
centrality is defined as the fraction of shortest paths running 
along the edge, (Jst(e) to the number of all possible shortest 
paths (Jst between s and t. 



6L(e.O = l-C7if = 1 



b<i.(eat) = $(est) = ^ 



Cat 



(1) 

An extension to this measure, called bridging centrality, 
appeared in [16] . Bridging centrality was defined as the rank 
product of the edge betweenness (Equation 1) and the edge 
bridging coefficient, which made use of the local network 
topology to quantify the extent to which an edge acts as a 
bridge. 

The measures of betweenness and bridging centrality are 
global bridging measures, i.e. they are computed by pro- 
cessing the whole graph. To reduce the computational re- 
quirements, one may consider local bridging measures, e.g. 
the edge-clustering coefficient [24]: 



A3) 



min[{d(s) - V),{d{t) - 1)] 



(3) 



mm[(d(s) - V),{d{t) - 1)] 



(2) 



In order for bh(e) to have a low value, the two endpoints 
of e need to have a lot of common neighbors (relative to 
their degree). Effectively, this means that in order to move 
from one of the endpoints to the other, one has multiple 
options in addition to e. Thus, e is considered as an intra- 
(or within-) community edge. In the opposite case, when the 
two endpoints of a bridge have very few or no neighbors in 
common, then this edge is crucial for the connection between 
its endpoints. For that reason, we consider in the latter case, 
where 6_L(e) has a high value, that e is an inter-community 
edge or bridge. 

In order to derive a decision threshold Bl for identifying 
the bridge edges of the graph (see line 8 of Algorithm 1), 
one needs to inspect the distribution of 6_l values among the 
edges of the graph. Figure 1 illustrates how the position of 
edges on a graph with community structure affects their lo- 
cal bridging values. The graph of Figure 1(a) was generated 
to comprise a synthetic four-community structure. Edges 
that link different communities with each other, i.e. inter- 
community edges, are drawn in dashed line. According to 
the distribution of Figure 1(b), these edges are characterized 
by high b_L values, therefore they can be separated by means 
of thresholding from the intra-community edges. 

The exact probability distribution function of 6^ for a 
given graph is available only after computing the local bridg- 
ing function for each edge of the graph, introducing in that 
way a global graph processing step in the Bridge Bound- 
ing methodology. However, this step does not impose severe 
restrictions on the computational process. First, according 
to Equation 3, the computation of bz, can be carried out 
in a streaming fashion, since only the neighborhoods of the 
two endpoints of each edge are required during the compu- 
tation. To further reduce the computational requirements, 
it is possible to derive an approximation of the bL proba- 
bility distribution by computing the local bridging values of 
a small random subset of the network edges. Finally, one 




0.1 0.2 0.3 0.4 0.5 O.S 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



(a) distribution (b) distribution, a = 0.7 

Figure 2: Distributions of first-order and 
second-order (6'^) local bridging on the English tag 
network of Section 4.2. Note that due to the 6l dis- 
tribution shape, it is impossible to select a value for 
Bl such that less than 8% of the network edges are 
considered intra-community. 

could even completely skip the distribution estimation step 
if it has been already performed for a graph of similar na- 
ture in the past (in which case one could reuse the previously 
estimated threshold Bl)- 

The simple measure of local bridging (Equation 3) em- 
ployed by Bridge Bounding is ideal for networks with very 
clear community structure (such as the one of Figure 1(a)). 
However, the measure is often not well-suited for detecting 
communities in real networks. In particular, when a network 
is characterized by scale- free topology, the distribution of 6i 
values will have a spiky shape, similar to the one in Figure 
2(a), where the depicted 6l distribution comes from the En- 
glish LYCOS iQ tag network of Section 4.2. In such cases, 
it is hard to differentiate between bridge and non-bridge 
edges. For instance, according to Figure 2(a), 8% of the 
network edges have local bridging hL = 0, thus VBl > 0, 
Algorithm 1 will always consider 8% of the network edges 
as non-bridges. In networks with scale-free topology (which 
commonly emerge in practice), such a decision would cause 
Bridge Bounding to detect a community structure that con- 
sists of one large community and many singleton commu- 
nities, i.e. communities comprising just one member. The 
reason for such an outcome is that scale-free networks main- 
tain a large connected component even when a large fraction 
of their edges are removed* [1] . Figure 3 illustrates the out- 
put of Bridge Bounding on a scale-free graph generated by 
the preferential attachment model of Barabasi- Albert [3]. 

In order to alleviate this problem, we consider the 2"'' 
order local bridging of an edge e, ^^(e), by computing the 
weighted sum (with a mixing parameter a) of its local bridg- 
ing, 6L(e) and the mean local bridging of the edges consti- 
tuting its neighborhood: 

b'Uest) = a-bLie.t) + {l~a)—^ ^ fei(e) (4) 

By applying Equation 4, we carry out a smoothing of the 
local bridging function by taking into account the values 
of the function in the neighborhood around a given edge. 
The a parameter defines the extent to which the values of 
the neighboring edges are taken into account in the compu- 

*Although Bridge Bounding does not explicitly remove 
edges from the underlying network, it treats bridging edges 
as bounds, i.e. as non-existent. 




Figure 3: Community structure found by Bridge 
Bounding on a 100-node scale-free graph. The struc- 
ture consists of one large (red squares), two small 
(green circles, yellow squares) and 32 singleton com- 
munities (black circles). 

tation of b'^. Figure 2(b) illustrates the distribution of b'^ 
(using a — 0.7) for the LYCOS iQ English tag network of 
Section 4.2. Low b'^ values are distributed more evenly in 
comparison to the 6l ones. Hence, it is possible to select a 
value for Bl such that only a very-low fraction of edges are 
considered as intra-community (~ 1% in this example). 

Effectively, the computation of 2"'' order local bridging 
makes use of topological information from a wider neighbor- 
hood around a given edge in comparison to local bridging. 
Following this, one could consider the u^^ order local bridg- 
ing, fo^"^' , which for sufficiently high values of v, would utilize 
topological information from the whole graph. Obviously, 
since the computation of 6^"^ is carried out in an iterative 
manner, the complexity of computing the measure increases 
with its order v. 

In terms of complexity, a progressive global community 
detection scheme based on Bridge Bounding is decomposed 
in two steps: (a) local network topology function compu- 
tation and (b) community detection. Computing the basic 
local bridging measure for a graph of n nodes and m edges 
with average node degree d has a complexity of 0{d ■ m) 
since for each edge, we need to find the intersection of two 
sets of average size d.^ The community detection step has 
a complexity of 0{d ■ n), when Algorithm 1 is used in the 
global community detection scheme described in 3.1, since 
for each node of the network d candidate nodes are consid- 
ered as candidates for assignment to the community that is 
currently being created. Thus, in total. Bridge Bounding 

2 

scales with 0{d ■ m + d ■ n). 

4. EVALUATION 

In this section, we present a series of experiments we car- 
ried out in order to gain insights into the performance of 
the proposed approach. The first part of the experiments 

'''For the computation of higher-order local bridging, the 

2 

complexity raises to 0{iy ■ d ■ m). However, we consider 
that most applications of Bridge Bounding will make use of 
second- or at most third-order local bridging functions. 




(a) Vout = 0.01 (b) Pout = 0.08 



Figure 4: Sample synthetic mixtures of communities 
generated using the same set of parameters {A'^ = 50, 
K — 2, ztot = 18} but different values for Pout- 

compares the performance of progressive global community 
detection based on Bridge Bounding with the one achieved 
by the community detection method of Girvan-Newman [15] . 
This comparison is carried out on synthetic networks with 
known (predefined) community structure, thus giving the 
possibility for objective measurement of the method perfor- 
mance. In the second part of the experiments, we aim at 
gaining insights into real-world complex networks. There- 
fore, we apply our community detection technique on two 
networks created from the user tagging activities in the En- 
glish and German version of the LYCOS iQ question / an- 
swering application. Since there is no ground truth concern- 
ing the community structure of the LYCOS iQ tag network, 
we use our subjective judgement in order to draw conclu- 
sions on the performance of the proposed method. 

4.1 Synthetic networks 

We created a parameterized community mixture genera- 
tor following the strategy described in [23] and [20]. Ac- 
cording to this, the generation process results in a network 
with A'^ nodes which consists of K communities. We control 
the average degree ztot of the network nodes, as well as the 
probability pout that a node's edge will connect to a node of 
a different community. Thus, out of the Ztot edges of each 
node (on average) , Zout = Pout ■ ztot edges connect the node 
to nodes of different communities. Obviously, higher values 
of Pout will lead to networks with less profound community 
structure. Figure 4 depicts the difference in the conspicu- 
ity of community structure in relation to the fraction pout 
of inter-community edges. This network generation process 
can be described by a four-element parameter set compris- 
ing A^, K, Ztot, and Pout- We also consider a fifth parameter, 
namely the community size variation s^ar, which is calcu- 
lated as the ratio of the biggest community size to the size 
of the smallest one. In this case, each community d will 
have a different average node degree zlot and therefore we 
define ztot ~ ■ 'Y^f^^^tot- In the end, we consider the 
five-element parameter set: 

Spar = {N, K, ztot,Pout, s„ar} (5) 

Two widely used measures to evaluate the effectiveness of 
data partitioning methods, e.g. community detection, when 
the true partition structure is known (which is the case when 
testing with synthetic networks) are (a) the fraction Fc of 
correctly classified instances [23] and (b) the Normalized 



Mutual Information (NMI) introduced in [12] and applied 
for the evaluation of community detection in [20]. Consider 
two partitions of the n-node graph, P" = {Cq , Cf, C^^} 
(true community structure) andP = {Cq, Ci, Cx(,} (com- 
munity structure found by algorithm). The fraction Fc of 
correctly classified instances is straightforward to compute 
only when Ka = Ki, = K. When the true number of commu- 
nities Ka differs from the number of communities Kf, found 
by the algorithm, we need to first identify a subset of the 
found communities P' C P*", that can be matched to a sub- 
set of the true communities, P° C P". We consider two 
communities as matching if they present overlap of more 
than 50%. Then, assuming that community £ P" is the 
matching community of £ P^, Fc is computed by the 
following equation. 

E \c:r\c'\ (6) 

The Normalized Mutual Information between the true 
partition, P", and the one found by the algorithm, P*", quan- 
tifies the extent to which they are similar to each other from 
an information-theoretic point of view [12]. 

^ -2-Efj,Efi.n-M^) 
NMI(P",P'') = ' ' , (7) 

Efl"i<Mf ) + Efiin5/off(;i) 

In Equation 7, and denote the number of nodes in 
communities Ct and respectively, and nlj denotes the 
number of shared nodes between communities Cf G P" and 
Cj £ P*". In general, NMI is preferred to the simplistic Fc 
measure, since it handles gracefully the cases where Ka 7^ 
Kb- Fc is presented here together with NMI mainly due to 
the ease in its interpretation. 

To demonstrate the effectiveness of Bridge Bounding in 
detecting the underlying community structure of networks, 
we compare the performance of the progressive global com- 
munity detection scheme (see Section 3.1) based on Bridge 
Bounding in terms of both Fc and NMI to the performance 
of the community detection method by Girvan and New- 
man (GN) [15] on a multitude of synthetic networks. Since 
the GN method employs a divisive approach, it results in a 
hierarchical community structure, which contains multiple 
graph partitions to communities. Therefore, we needed to 
select a single partition from the hierarchy, which we would 
use to evaluate the performance of the method. The strategy 
used by Newman and Girvan in [23] to make this selection 
is to calculate the modularity Q of each partition and select 
the partition which maximizes it. 

The modularity of a network partition into K communi- 
ties is calculated from the K x K symmetric matrix e whose 
element eij is the fraction of all edges in the network that 
link vertices in community i to vertices in community j. Fur- 
ther, we define the row (or column) sums Oi = E-, ^ij which 
represent the fraction of edges that connect to vertices in 
community i. Based on the above definitions, the measure 
of modularity is defined as: 

Q = ^(e>. - a?) (8) 

i 

This quantity measures the fraction of edges in the net- 



Table 1: Comparison of performance between a 
global scheme based on Bridge Bounding with lo- 
cal bridging (BB), Bridge Bounding with 2"'' or- 
der local bridging (BB') and the method of Girvan 
and Newman (GN) [15]. The performance is mea- 
sured on synthetic networks generated using the set 
SpAR = {200,4, 40, Pout, 1.0} of parameters, with pout 
being the free parameter. 





Fc 


NMI 


Pout 


BB 


BB' 


GN 


BB 


BB' 


GN 


0.01 


100 


100 


100 


1.0 


1.0 


1.0 


0.05 


100 


100 


100 


1.0 


1.0 


1.0 


0.1 


100 


100 


50 


1.0 


1.0 


0.86 


0.15 


100 


99 


50 


1.0 


.98 


0.86 


0.20 


99 


74 


50 


0.98 


0.84 


0.86 


0.25 


24 


24 





0.54 


0.56 


0.02 



Table 2: Similar comparison of performance as in 
Table 1, but on synthetic networks that were gen- 
erated using the set SIar = {200, 4, 40, 0.01, s„„.} of 
parameters, with Svar being the free parameter. 





Fc 


NMI 


Svar 


BB 


BB' 


GN 


BB 


BB' 


GN 


1.1 


100 


100 


100 


1.0 


1.0 


1.0 


1.5 


100 


100 


100 


1.0 


1.0 


1.0 


1.6 


99.5 


100 


100 


0.99 


1.0 


1.0 


1.7 


88 


98 


100 


0.82 


0.96 


1.0 


1.8 


85.5 


97 


100 


0.79 


0.95 


1.0 


1.9 


58.5 


87 


90 


0.68 


0.82 


0.88 


2.0 


12.5 


80 


82 


0.45 


0.73 


0.81 


2.5 





62 


75 


0.45 


0.63 


0.72 



work that connect vertices of the same community (i.e. intra- 
community edges) minus the expected value of the same 
quantity in a network with the same community partition 
but random connections between the vertices. If the num- 
ber of intra-community edges is no better than random, we 
would get (3 = 0. For perfect separation to communities 
(i.e. communities that are completely disconnected from 
each other on the graph), we get Q = 1. In practice, modu- 
larity values in the range from 0.3 to 0.7 indicate significant 
community structure. 

We created two sets of networks containing synthetic com- 
munities. The first set of such networks was generated hold- 
ing the four network generation parameters of Equation 5 
constant and varying Pout- This is a widely adopted test 
process [23, 24, 20] to test the performance of a community 
detection method as the communities of the synthetic graph 
gradually become less well-separated. Table 1 presents the 
comparison between the performance of Bridge Bounding 
(by use of both first- and second-order local bridging) and 
the GN method [15]. Both Bridge Bounding methods present 
equally good or better performance than GN across the 
range of pout values that were used for testing. 

A further test involved the generation of an additional set 
of networks by varying the Svar parameter in order to end 
up with networks comprising communities of unequal sizes. 
Table 2 provides an overview of the results obtained from 
the three methods of our study. Apparently, the use of the 
local bridging function (Equation 3) becomes problematic 





(a) tag frequency 



(b) cooccurrence frequency 




(c) cooccurrence frequency 

Figure 5: Rank plots of tag, cooccurrence frequen- 
cies and node degrees for the German and English 
LYCOS iQ tag networks. 



for Bridge Bounding as soon as the size variation among the 
underlying communities exceeds a certain value (e.g. for 
Svar > 2, we measured NMI(_B_B) < 0.5). In contrast. 
Bridge Bounding with the use of 2"'' order local bridging 
as well as the GN method yielded consistently better results 
in this series of tests. Hence, it becomes clear that the use 
of more sophisticated local topology measures, such as the 
2"d Qj.jjg]- local bridging, could be crucial for the success of 
the proposed method. 

4.2 LYCOS iQ tag network 

LYCOS iQ is a collaborative question/answering applica- 
tion where people ask and answer questions on any topic. 
The application is available in six languages, German, En- 
glish, French, Danish, Swedish and Dutch with German at- 
tracting the largest community of users. In order to support 
the users' efi'orts of searching for relevant questions, the ap- 
plication incorporates a tagging functionality, similar to the 
one used in typical social tagging systems such as delicious^ 
and flickr'^. There are no static categories and tags are not 
predefined by the system, but the users' inputs are checked 
against tags existing in the system database to prevent du- 
plicates. 

Question submitters have the possibility of attaching more 
than one tag to each of their questions. Therefore, it is pos- 
sible to create a tag network from the collaborative tagging 
activities of users. In this network, the vertex set comprises 
the tags chosen by users to tag their questions and the edge 
set contains the co-occurrences between tags in the users' 
questions. When a question is tagged with more than two 
tags, then all possible pairwise co-occurrences are added to 
the network. For each tag of the network, its frequency (tf) 
is available. Further, the co-occurrence frequency (c/) be- 
tween each pair of tags is available. 

Figure 5 illustrates the rank plots of tag and cooccur- 
rence frequencies as well as of the node degrees observed 

®http: / /delicious. com 
^http: / /flickr.com 



Table 3: Tag networks used in this case study. For 
each network G = {V,E}, it is \V\ — tags and |_E| = 
tag-pairs. 





tags 


tag-pairs 


questions 


English (UK) 


9,517 


77,243 


62,497 


German (DE) 


78,138 


896,486 


942,405 



in the German and English LYCOS IQ tag networks. A 
highly skewed behavior is obvious in the tagging activities 
of users, e.g. in the English dataset, a small set of tags is 
used very frequently (hundreds of times) , while the majority 
of them is used less than 10 times. The frequency of cooc- 
currence between tags follows a similar pattern, with less 
than a thousand tag pairs occurring together in more than 
a few questions. Finally, the node degrees follow a long-tail 
distribution, indicating that the tag networks are character- 
ized by scale-free topology. A considerable number of tags 
are even disconnected from the rest of the network meaning 
that they were used in isolation. To reduce the amount of 
noisy tags in the network, we filtered out tags that were ei- 
ther disconnected or appeared less than twice in the dataset. 

Table 3 provides a summary of the two tag networks that 
we obtained after the aforementioned filtering step. The tag 
network induced from the tagging activity on the German 
version of LYCOS iQ, is far larger than the one created from 
the English version. Nevertheless, we preferred to present 
community snapshots and examples only from the English 
tag network to ensure that even readers who are not famil- 
iar with German can understand them. Since the proposed 
community detection method relies only on information re- 
garding the network topology, the outcome of the method 
is language independent. We could confirm this intuition 
by inspecting the community detection results on both tag 
networks. 

Figure 6 provides a high-level view of the most prominent 
topics coming up through the users' questions in the LYCOS 
iQ application. The tags depicted in this view were selected 
based on their degree in the network. Although the resulting 
network is very densely connected, one can already see that 
all tags (apart from the pair "IQ"-"GENERAL INTEREST") 
belong to different communities (since the dashed edges have 
been found to be inter-community edges, after thresholding 
based on the 6'^ distribution of the network in Figure 7). 

In order to explore the topic structure of the tag network 
in more depth, we selected some of the top-level tags as seed 
nodes and inspected the resulting communities. Figures 8 
and 9 present the communities around tags "computers" and 
"history". In both figures it is apparent that most edges are 
considered intra-community. Also, note that while the "com- 
puters" community is densely connected, the "history" com- 
munity resembles a star-shaped graph: it remains connected 
through its central tag ("history"). 

Four additional tag communities are depicted in Figure 
10. The complexity of their structure depends on the topic 
of the respective community. For instance, the community 
formed around the tag "music" (Figure 10(a)) has a much 
simpler structure than the one created using "science" as the 
seed tag (Figure 10(b)). There are two possible reasons for 
this: (a) science is a more general topic than music, con- 
taining sub-topics such as physics, medicine, biology and 
astronomy (these correspond to the four large nodes of Fig- 




Figure 6: Overview of the English LYCOS iQ tag 
network. We use the following conventions for tag 
network visualizations: (a) Font size is proportional 
to tag frequency, (b) Edge thickness is proportional 
to cooccurrence frequency, (c) Edges identified as 
bridges are drawn in dashed line. 
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Figure 7: Distribution of 6^ values in the two LY- 
COS iQ tag networks. 




Figure 8: Community around tag "computers". 




Figure 9: Community around tag "history". 

ure 10(b)), (b) the questions submitted by LYCOS iQ users 
(and consequently tiie tags used to describe tliem) are more 
focused to particular aspects of music, e.g. pop music artists. 

Further, a noteworthy observation regarding the struc- 
ture of the communities around "film" (Figure 10(c)) and 
"animals" (Figure 10(d)) is the existence of small cliques 
(between 3 and 5 members) within them. Those correspond 
to tags related to particular films in the "film" community 
(e.g. "batman"-"Christian Bale"-"comic") or tags related to 
groups of animals (e.g. "leopards"- "panthers"-"mammals") 
in the "animals" community. This indicates the existence 
of semantic hierarchies within topics (e.g. "mammals" are a 
subclass of "animals"; "leopards", "panthers" are a subclass 
of "mammals"), which could be further validated by means 
of machine learning techniques [30] . 

As stated earlier, detecting the topic communities within 
a tag network, similar to the one created from LYCOS iQ 
application (nowadays, there are plenty of Web 2.0 appli- 
cations incorporating collaborative tagging characteristics), 
can be beneficial for both the users and the administrators 
of the application. Users can be provided with a commu- 
nity view of the tags that are related to their context. For 
instance, when a LYCOS iQ user submits a question to the 
system, the text of her question can be parsed and matched 
against the tags already available in the system. Then, by 
identifying the community (or communities) that her ques- 
tion belongs to, it is possible to recommend relevant tags for 
use as descriptors of the question or relevant questions that 
have been tagged with tags belonging to the respective com- 
munity. Further, administrators of such applications could 
use community detection in the context of a content moni- 
toring and trend tracking framework for supporting the op- 
eration of important administrative tasks, e.g. online ad 
targeting or content moderation (which is most frequently 
synonymous to spam detection). 

5. CONCLUSIONS 

We introduced Bridge Bounding, a local methodology for 
community detection in large networks. The methodology is 
based on the notion of local network topology functions to 
quantify the extent to which edges act as community bound- 
aries, i.e. bridges. We showed that use of local bridging, a 
topology function based on the widely used edge cluster- 




(a) Music (b) Science 




(c) Film (d) Animals 



Figure 10: Further examples of community shapes. 
The presented communities were created using "mu- 
sic", "science", "film" and "animals" as seed nodes. 



ing coefficient, resulted in successful discovery of existing 
community structure in synthetic networks, but failed to do 
so in networks of scale-free topology. For that reason, we 
employed the second- and higher-order local bridging func- 
tions to derive smoother estimates of the bridging properties 
of edges. The proposed methodology is extremely efficient, 
scaling with 0{d ■m-^d-n) for networks of n nodes and m 
edges with average node degree d. 

A series of tests on synthetic networks with controlled 
community structure provides evidence that the Bridge Bound- 
ing method (with use of the 2"'' order local bridging func- 
tion) performs equally well or better than the widely used 
method of Girvan and Newman. Moreover, application of 
our method on two large tag networks coming from the LY- 
COS iQ question/answering application proved beneficial in 
studying the underlying topic structure and can benefit both 
users and administrators of Web 2.0 applications with social 
tagging features. 

In the future, we plan to carry out more thorough eval- 
uation tests on the tag communities produced by Bridge 
Bounding. Specifically, we plan to conduct a user study 
among selected LYCOS iQ users in order to derive man- 
ual judgements on the quality of the detected communities. 
Subsequently, we are going to consider the potential of new 
edge bridging functions and of more sophisticated strategies 
for community detection based on Bridge Bounding. Instead 
of the currently employed fixed-threshold strategy for decid- 
ing whether an edge is intra- or inter-community, we will test 
the potential of adaptive threshold strategies. Finally, we 
intend to look into extensions that will endow the method 
with capabilities for uncovering hierarchical relations within 
the community structure. 
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